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Abstract 

Ontology-based data access is an approach to organizing access to 
a database augmented with a logical theory. In this approach query 
answering proceeds through a reformulation of a given query into a 
new one which can be answered without any use of theory. Thus the 
problem reduces to the standard database setting. 

However, the size of the query may increase substantially during the 
reformulation. In this survey we review a recently developed framework 
on proving lower and upper bounds on the size of this reformulation 
by employing methods and results from Boolean circuit complexity. 


1 Introduction 

Ontology-based data access is an approach to storing and accessing data 
in a databasl^. In this approach the database is augmented with a first- 
order logical theory, that is the database is viewed as a set of predicates on 
elements (entities) of the database and the theory contains some universal 
statements about these predicates. 

The idea of augmenting data with a logical theory has been around since 
at least 1970s (the Prolog programming language, for example, is in this 
flavor m)- However, this idea had to constantly overcome implementational 
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issues. The main difficulty is that if the theory accompanying the data is 
too strong, then even standard algorithmic tasks become computationally 
intractable. 

One of these basic algorithmic problems will be of key interest to us, 
namely the query answering problem. A query to a database seeks for all 
elements in the data with certain properties. In case the data is augmented 
with a theory, query answering cannot be handled directly with the same 
methods as for usual databases and new techniques are required. 

Thus, on one hand, we would like a logical theory to help us in some way 
and, on the other hand, we need to avoid arising computational complica¬ 
tions. 

Ontology-based data access (OBDA for short) is a recent approach in 
this direction developed since around 2005 [SI ini Ea El]. Its main purpose 
is to help maintaining large and distributed data and make the work with 
the data more user-friendly. The logical theory helps in achieving this goal 
by allowing one to create a convenient language for queries, hiding details 
of the structure of the data source, supporting queries to distributed and 
heterogeneous data sources. Another important property is that data does 
not have to be complete. Some of information may follow from the theory 
and not be presented in the data explicitly. 

A key advantage of OBDA is that to achieve these goals, it is often enough 
in practice to supplement the data with a rather primitive theory. This is 
important for the query answering problem: the idea of OBDA from the 
algorithmic point of view is not to develop a new machinery, but to reduce 
query answering with a theory to the standard database query answering 
and use the already existing machinery. 

The most standard approach to this is to first reformulate a given query 
in such a way that the answer to the new query does not depend on the theory 
anymore. This reformulation is usually called a rewriting of the query. The 
rewriting should be the same for any data in the database. Once the rewrit¬ 
ing is built we can apply standard methods of database theory. Naturally, 
however, the length of the query typically increases during the reformulation 
and this might make this approach (at least theoretically) inefficient. 

The main issue we address in this survey is how large the rewriting can 
be compared to the size of the original query. Ideally, it would be nice if the 
size of the rewriting is polynomial in the size of the original query. In this 
survey we will discuss why rewritings can grow exponentially in some cases 
and how Boolean circuit complexity helps us to obtain results of this kind. 

In this survey we will confine ourselves to data consisting only of unary 
and binary predicates over the database elements. If data contains predicates 
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of larger arity, the latter can be represented via binary predicates. Such 
representations are called mappings in this field and there are several ways 
for doing this. We leave the discussion of mappings aside and refer the reader 
to |18| and references therein. We call a data source with unary and binary 
predicates augmented with a logical theory a knowledge base. 

As mentioned above, in OBDA only very restricted logical theories are 
considered. There are several standard families of theories, including OWL 2 QL [20| 
and several fragments of Datalog^ [3 El El E]. The lower bounds on 
the size of rewritings we are going to discuss work for even weaker theories 
contained in all families mentioned above. The framework we describe also 
allows one to prove upper bounds on the size of the rewritings that work for 
theories given in OWL 2 QL. We will describe the main ideas for obtaining 
upper bounds, but will not discuss them in detail. 

To give a complete picture of our setting, we need also to discuss the 
types of queries and rewritings we consider. The standard type of queries 
(as a logical formulas) considered in this field is conjunctive queries, i.e. 
conjunctions of atomic formulas prefixed by existential quantifiers. In this 
survey we will discuss only this type of queries. 

As for rewritings, it does not make sense to consider conjunctive formulas 
as rewritings, since their expressive power is rather poor. The simplest type 
of rewritings that is powerful enough to provide a rewriting for every query 
is a DNF-rewriting, which is a disjunction of conjunctions with an existen¬ 
tial quantifiers prefix. However, it is not hard to show (see m) that this 
type of rewriting may be exponentially larger than the original query. More 
general standard types of rewritings are first-order (FO-) rewritings, where 
a rewriting can be an arbitrary first-order formula, positive existential (PE-) 
rewritings, which are first-order formulas containing only existential quanti¬ 
fiers and no negations (this type of rewritings is motivated by its convenience 
for standard databases), and the nonrecursive datalog rewriting, which are 
not first-order formulas but rather are constructed in a more circuit-flavored 
way (see Section El for details). 

For these more general types of rewritings it is not easy to see how the 
size of the rewriting grows in size of the original query. The progress on this 
question started with the paper m, where it was shown that the polynomial 
size FO-rewriting cannot be constructed in polynomial time, unless P = NP. 

Soon after that, the approach of that paper was extended in mm to give a 
much stronger result: not only there is no way to construct a FO-rewriting in 
polynomial time, but even there is no polynomial size FO-rewriting, unless 
NP C P/poly. It was also shown (unconditionally!) in |15l lll| that there 
are queries and theories for which the shortest PE- and NDL-rewritings are 


3 


exponential in the size of the original query. They also obtained an expo¬ 
nential separation between PE- and NDL-rewritings and a superpolynomial 
separation between PE- and FO-rewritings. 

These results were obtained in mm by reducing the problems of lower 
bounding the rewriting size to some problems in computational complexity 
theory. Basically, the idea is that we can encode a Boolean function / G NP 
into a query q and design the query and the theory in such a way that a FO- 
rewriting of q will provide us with a Boolean formula for /, a PE-rewriting 
of q will correspond to a monotone Boolean formula, and an NDL-rewriting 
— to a monotone Boolean circuit. Then by choosing an appropriate / and 
applying known results from circuit complexity theory, we can deduce the 
lower bounds on the sizes of the rewritings. 

The next step in this line of research was to study the size of rewritings 
for restricted types of queries and knowledge bases. A natural subclass of 
conjunctive queries is the class of tree-like queries. To define this class, for 
a given query consider a graph whose vertices are the variables of the query 
and an edge connects two variables if they appear in the same predicate 
in the query. We say that a query is a tree-like if this graph is a tree. A 
natural way to restrict theories of knowledge bases is to consider their depth. 
Informally, the theory is of depth d if, starting with a data and generating 
all new objects whose existence follows from the given theory, we will not 
obtain in the resulting underlying graph any sequences of new objects of 
length greater than d. These kinds of restrictions on queries and theories 
are motivated by practical reasons: they are met in the vast majority of 
applications of knowledge bases. On the other hand, in papers m m 
non-constant depth theories were used to prove lower bounds on the size of 
rewritings. 

Subsequent papers |16l 0] managed to describe a complete picture of 
the sizes of the rewritings in restricted cases described above. To obtain 
these results, they determined, for each case mentioned above, the class of 
Boolean functions / that can be encoded by queries and theories of the 
corresponding types. This establishes a close connection between ontology- 
based data access and various classes in Boolean circuit complexity. Together 
with known results in Boolean circuit complexity, this connection allows one 
to show various lower and upper bounds on the sizes of rewritings in all 
described cases. The precise formulation of these results is given in Section 01 

To obtain their results, 1110] also introduced a new intermediate com¬ 
putational model, the hypergraph programs, which might be of independent 
interest. A hypergraph program consists of a hypergraph whose vertices are 
labeled by Boolean constants, input variables xi,... or their negations. 
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On a given input x € {0,1}"', a hypergraph program outputs 1 iff all its 
vertices whose labels are evaluated to 0 on this input can be covered by a 
set of disjoint hyperedges. We say that a hypergraph program computes 
/: {0,1}" —{0,1} if it outputs f{x) on every input x G {0,1}". The 
size of a hypergraph program is the number of vertices plus the number of 
hyperedges in it. 

Papers HSlIl] studied the power of hypergraph programs and their re¬ 
stricted versions. As it turns out, the class of functions computable by gen¬ 
eral hypergraph programs of polynomial size coincides with N P/poly |16) . 
The same is true for hypergraph programs of degree at most 3, that is for 
programs in which the degree of each vertex is bounded by 3. The class of 
functions computable by polynomial size hypergraph programs of degree at 
most 2 coincides with NL/poly |16j . Another interesting case is the case of 
tree hypergraph programs which have an underlying tree and all hyperedges 
consist of subtrees. Tree hypergraph programs turn out to be equivalent 
to SAC^ circuits [1]. If the underlying tree is a path, then polynomial size 
hypergraph programs compute precisely the functions in NL/poly [1]. 

The rest of the survey is organized as follows. In Section [2] we give the 
necessary definitions from Boolean circuit complexity. In Section [3] we give 
the necessary definitions and basic facts on knowledge bases. In Section 0] 
we describe the main idea behind the proofs of bounds on the size of the 
rewritings. In Section [5] we introduce hypergraph programs and explain how 
they help to bound the size of the rewritings. In Section [6] we discuss the 
complexity of hypergraph programs. 

2 Boolean circuits and other computational models 

In this section we provide necessary information on Boolean circuits, other 
computational models and related complexity classes. For more details 
see |13| . 

A Boolean circuit C is an acyclic directed graph. Each vertex of the 
graph is labeled by either a variable among xi,... ,Xn, or a constant 0 or 1, 
or a Boolean function A or V. Vertices labeled by variables and constants 
have in-degree 0, vertices labeled by -■ have in-degree 1, vertices labeled by 
A and V have in-degree 2. Vertices of a circuit are called gates. Vertices 
labeled by variables or constants are called input gates. For each non-input 
gate g its inputs are the gates which have out-going edges to g. One of the 
gates in a circuit is labeled as an output gate. Given x G {0,1}", we can 
assign the value to each gate of the circuit inductively. The values of each 


5 


input gate is equal to the value of the corresponding variable or constant. 
The value of a -i-gate is opposite to the value of its input. The value of a 
A-gate is equal to 1 iff both its inputs are 1. The value of a V-gate is 1 iff 
at least one of its inputs is 1. The value of the circuit C(x) is defined as 
the value of its output gate on x € {0,1}"^. A circuit C computes a function 
/: {0,1}” —^ {0,1} iff C{x) = f{x) for all x G {0,1}”. The size of a circuit 
is the number of gates in it. 

The number of inputs n is a parameter. Instead of individual functions, 
we consider sequences of functions / = {/njnsN) where /„: {0,1}"' —>• {0,1}. 
A sequence of circuits C = computes f iff Cn computes /„ for all n. 

From now on, by a Boolean function or a circuit we always mean a sequence 
of functions or circuits. 

A formula is a Boolean circuit such that each of its gates has fan-out 1. 
A Boolean circuit is monotone iff there are no negations in it. ft is easy to 
see that any monotone circuit computes a monotone Boolean function and, 
on the other hand, any monotone Boolean function can be computed by a 
monotone Boolean circuit. 

A circuit C is a polynomial size circuit (or just polynomial circuit) if 
there is a polynomial p € Z[a:] such that the size of Cn is at most p{n). 

Now we are ready to define several complexity classes based on circuits. 
A Boolean function / lies in the class P/poly iff there is a polynomial size 
circuit C computing /. A Boolean function / lies in the class NC^ iff there is 
a polynomial size formula C computing /. A Boolean function / lies in the 
class N P/poly iff there is a polynomial p{n) and a polynomial size circuit C 
such that for all n and for all x G {0,1}” 

/(f) = 1 4^ G {0, Cn+p(n)ix,ij) = 1. (1) 

Complexity classes P/poly and N P/poly are nonuniform analogs of P and 

NP. 

We can introduce monotone analogs of P/poly and NC^ by considering 
only monotone circuits or formulas. In the monotone version of N P/poly it 
is only allowed to apply negations directly to y-inputs. 

The depth of a circuit is the length of the longest directed path from 
an input to the output of the circuit, ft is known that / G NC^ iff / can 
be computed by logarithmic depth circuit m- By SAC^ we denote the 
class of all Boolean functions / computable by a polynomial size logarithmic 
depth circuit such that V-gates are allowed to have arbitrary fan-in and all 
negations are applied only to inputs of the circuit |24) . 

A nondeterministic branching program P is a directed graph G = {V,E), 
with edges labeled by Boolean constants, variables xi,... ,Xn or their nega- 
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tions. There are two distinguished vertices of the graph named s and t. On 
an input x G {0,1}” a branching program P outputs P(x) = 1 iff there is a 
path from s to t going through edges whose labels evaluate to 1. A nonde- 
terministic branching program P computes a function /: {0,1}"' —>■ {0,1} iff 
for all X G {0,1}” we have P(x) = f{x). The size of a branching program is 
the number of its vertices plus the number of its edges |y| + |P|. A branching 
program is monotone if there are no negated variables among labels. 

Just as for the functions and circuits, from now on by a branching pro¬ 
gram we mean a sequence of branching programs P„ with n variables for all 
n G N. 

A branching program P is a polynomial size branching program if there 
is a polynomial p G Z[x] such that the size of Pn is at most p{n). 

A Boolean function / lies in the class NBP iff there is a polynomial 
size branching program P computing /. It is known that NBP coincides 
with nonuniform analog of nondeterministic logarithmic space NL, that is 
NBP = NL/poly [IS[23]. 

For every complexity class K introduced above, we denote by mK its 
monotone counterpart. 

The following inclusions hold between the classes introduced above m 

NC^ C NBP C SAC^ C P/poly C NP/poly. (2) 

It is a major open problem in computational complexity whether any of these 
inclusions is strict. 

Similar inclusions hold for monotone case: 

mNC^ C mNBP C mSAC^ C mP/poly C mNP/poly. (3) 

It is also known that mP/poly ^ mNP/poly [22l[T] and mNBP / mNC^ [Hj. 
We will use these facts to prove lower bounds on the rewriting size. 

3 Theories, queries and rewritings 

In this survey a data source is viewed as a first-order theory. It is not an 
arbitrary theory and must satisfy some restrictions, which we specify below. 

First of all, in order to specify the structure of data, we need to fix a set of 
predicate symbols in the signature. Informally, they correspond to the types 
of information the data contains. We assume that there are only unary and 
binary predicates in the signature. The data itself consists of a set of objects 
(entities) and of information on them. Objects in the data correspond to 
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constants of the signature. The information in the data corresponds to closed 
atomic formulas, that is predicates applied to constants. These formulas 
constitute the theory corresponding to the data. We denote the resulting set 
of formulas by D and the set of constants in the signature hy A£>. 

We denote the signature (the set of predicate symbols and constants) by 
S. Thus, we translated a data source into logical terms. To obtain knowledge 
base, we introduce more complicated formulas into the theory. The set of 
these formulas will be denoted by T and called an ontology. We will describe 
which formulas can be presented in T a bit later. The theory D U T is 
called a knowledge base. Predicate symbols and the theory T determines the 
structure of the knowledge base and thus will be fixed. Constants Ajo and 
atomic formulas D, on the other hand, determine the current containment 
of the data, so they will be varying. 

As we mentioned in Introduction, we will consider only conjunctive queries. 
That is, a query is a formula of the form 

q{x) = 3yip{x,y), 

where (/? is a conjunction of atomic formulas (or atoms for short). For sim¬ 
plicity we will assume that q does not contain constants from Aij. 

What does the query answering mean for standard data sources without 
ontology? It means that there are values for x and y among Ad such that 
the query becomes true on the given data D. That is, we can consider a 
model Id corresponding to the data D. The elements of the model Id are 
constants in A/) and the values of predicates in Id is given by formulas in 
D. That is, a predicate P gT, is true on a from A^) iff P{a) € D. The tuple 
of elements a of Id is an answer to the query q{x) if 

Id 1= 3y(p{a,y). 

Let us go back to our setting. Now we consider data augmented with a 
logical theory. This means that we do not have a specific model. Instead, 
we have a theory and we need to find out whether the query is satisfied in 
the theory. That is, the problem we are interested in is, given a knowledge 
base DUT and a query q{x), to find a in A^) such that 

DUT 1= q{a). 

If X is an empty tuple of variables, then the answer to the query is ‘y^s’ or 
‘no’. In this case we say that the query is Boolean. 

The main approach to solving the query answering problem is to first 
reformulate the query in such a way that the answer to the new query does 



not depend on the theory T and then apply the machinery for standard 
databases. This leads us to the following definition. A first-order formula 
q'{x) is called a rewriting of q{x) w.r.t. a theory T if 

DUT\=q{d) 4^ lD\=q'{d) (4) 

for all D and for all a. We emphasize that on the left-hand side in (jD) the 
symbol ‘|=’ means logical consequence from a theory, while on the right-hand 
side it means truth in a model. 

We also note that in (|4]) only predicate symbols in S and the theory T 
are fixed. The theory D (and thus, the set of constants in the signature) may 
vary, so the rewriting should work for any data D. Intuitively, this means 
that the structure of the data is fixed in advance and known, and the current 
content of a knowledge base may change. We would like the rewriting (and 
thus the query answering approach) to work no matter how the data change. 

What corresponds to a model of the theory D U T? Since the data D 
is not assumed to be complete, it is not a model. A model correspond 
to the content of the “real life” complete data, which extends the data D. 
We assume that all formulas of the theory hold in the model, that is all 
information in the knowledge base (including formulas in T) is correct. 

However, if we allow to use too strong formulas in our ontology, then the 
problem of query answering will become algorithmically intractable. So we 
have to allow only very restricted formulas in T. On the other hand, for the 
practical goals of OBDA also only very simple formulas are required. 

There are several ways to restrict theories in knowledge bases. We will 
use the one that fits all most popular restrictions. Thus our lower bounds 
will hold for most of the considered settings. As for the upper bounds, we 
will not discuss them in details, however, we mention that they hold for 
substantially stronger theories and cover OWL 2 QL framework [20| . 

Formulas in the ontology T are restricted to the following form 

yx{(p{x) ^3y^{x,y)), (5) 

where x and y are (single) variables, is a unary predicate and 'iIj{x, y) is a 
conjunction of atomic formulas. 

It turns out that if T consists only of formulas of the form ([5]), then the 
rewriting is always possible. The (informal) reason for this is that in this 
case there is always a universal model for given D and T. 

Theorem 1. For all theories D,T such that T consists of formulas of the 
form JS]) there is a model such that 

D UT \= q{a) Mu \= q{a) 
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for any conjunctive query q and any a. 

Remark. Note that the model Mjo actually depends on both D and T. We 
do not add T as a subscript since in our setting T is fixed and D varies. 

The informal meaning of this theorem is that for ontologies T specified 
by ([5]) there is always the most general model. More formally, for any other 
model M of Z) U T there is a homomorphism from the universal model Mq 
to M. We provide a sketch of the proof of this theorem. For us it will be 
useful to see how the model Mu is constructed. 

Proof sketch. The informal idea for the existence of the universal model is 
that we can reconstruct it from the constants presented in the data D. 
Namely, hrst we add to Mq all constants in A^) and we let all atomic for¬ 
mulas in D to be true on them. Next, from the theory T it might follow that 
some other predicates should hold on the constants in A/j. We also let them 
to be true in Mu- What is more important, formulas in T might also imply 
the existence of new elements related to constants (the formula ([5]) implies, 
for elements x that satisfy (^(x), the existence of a new element y). We add 
these new elements to the model and extend predicates on them by deducing 
everything that follows from T. Next, T may imply the existence of further 
elements that are connected to the ones obtained on previous step. We keep 
adding them to the model. It is not hard to see that the resulting (possibly 
infinite) model is indeed the universal model. We omit the formal proof of 
this and refer the reader to HU. □ 

So, instead of considering a query q over D L)T we can consider it over 
Mjj. This observation helps to study rewritings. 

It is instructive to consider the graph underlying the model Mq. The 
vertices of the graph are elements of the model and there is a directed edge 
from an element mi to an element m 2 if there is a binary predicate P such 
that Md \= P(mi, m 2 ). Then in the process above we start with a graph on 
constants from A^ and then add new vertices whose existence follows from 
T. Note that the premise of the formula Q consists of a unary predicate. 
This means that the existence of a new element in the model is implied solely 
by one unary predicate that holds on one of the already constructed vertices. 
Thus for each new vertex of the model we can trace it down to one of the 
constants a of the theory and one of the atomic formulas B{a) G D. 

The maximal (over all D) number of steps of introducing new elements 
to the model is called the depth of the theory T. This parameter will be of 
interest to us. We note that Mo and thus the depth of T are not necessarily 
finite. 
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In what follows it is useful to consider, for each unary predicate ^4 G S, 
the universal model Mq for the theory D = {^(a)}. As we mentioned, the 
universal model for an arbitrary D is “build up” from these simple universal 
models. We denote this model by Ma (instead of M^A{a)}) call it the 
universal tree generated by A. The vertex a in the corresponding graph is 
called the root of the universal tree. All other vertices of the tree are called 
inner vertices. To justify the name “tree” we note that the underlying graph 
of Ma in all interesting cases is a tree, though not in all cases. More precisely, 
it might be not a tree if some formula ([5]) in T does not contain any binary 
predicate R{x,y). 

Example 1. To illustrate, consider an ontology T describing a part of a 
student projects organization: 

Vx (yStudent{x) —>■ {worksOn{x, y) A Project{y ))'), 

Vx (^Project{x) 3y {isManagedBy{x,y) A Professor{y))^, 

Vx,y (^worksOn{x,y) —>■ involves{y, x)), 

Vx,y (^isManagedBy{x, y) involves{x, y)). 

Some formulas in this theory are of the form different than ([5]), but it will 
not be important to us. Moreover, it is not hard to see that this theory can 
be reduced to the form ([5|) (along with small changes in data). 

Consider the query q(x) asking to find those who work with professors: 

q{x) = 3y, z {worksOn{x, y) A involves{y, z) A Professor{z)). (6) 

It is not hard to check that the following formula is a rewriting of q: 

q'{x) = 3y, z [worksOn{x,y)A 

{worksOn{z, y) V isManagedBy{y, z) V involves{y, z)) A Professor{z)^^ V 
3y\^worksOn{x.,y) A Project{y)^^ V Student{x). 


That is, for any data D and any constant a in D, we have 

Dy3T\=q{a) lD\=q'{a)- 
To illustrate the universal model, consider the data 
D = ^ Student{c), worksOn{c, b), Project{b), isManagedBy{b, a). 
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The universal model Md is presented in Fig.[TJ The left region corresponds to 
the data D, the upper right region corresponds to the universal tree generated 
by Project{b) and the lower right region corresponds to the universal tree 
generated by Student(c). The label of the form P~ on an edge, where P 
is a predicate of the signature, means that there is an edge in the opposite 
direction labeled by P. 
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Figure 1: An example of a universal model 

We note that for our query q{x) we have that q{c) follows from DUT and 
we can see that the rewriting q'{c) is true in M^). Note, however, that q{c) 
is not true in D due to the incompleteness of the data D: it is not known 
that a is a professor. 

From the existence of the universal model (and simplicity of its structure) 
it can be deduced that for any q there is a rewriting q' having the form of 
(existentially quantihed) disjunction of conjunctions of atoms. However, it 
is not hard to provide an example that this rewriting is exponentially larger 
than q (see m)- By the size of the rewriting we mean the number of symbols 
in the formula. 

So to obtain shorter rewriting it is helpful to consider more general types 
of formulas. A natural choice would be to allow arbitrary first-order formula 
as a rewriting. This type is called a hrst-order rewriting, or a FO-rewriting. 
Another option is a positive existential rewriting, or a PE-rewriting. This 
is a special case of FO-rewriting in which there are no negations and there 
are only existential quantifiers. PE-rewritings are more preferable than FO- 
rewritings since they are more accessible to algorithmic machinery developed 
for usual databases. The size of a PE- or a FO-rewriting is a number of 
symbols in the formula. 

Another standard type of rewriting is a nonrecursive datalog rewriting, 
or NDL-rewriting. This rewriting does not have a form of hrst-order formula 
and instead has the form of DAG-representation of a hrst-order formula. 
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Namely, NDL-rewriting consists of the set 11 of formulas of the form 


Vx (^1 A ... A An — > ^o)) 

where Ai are atomic formulas (possibly new, not presented in the original 
signature S) not necessarily of arity 1 or 2. Each Ai depends on (some of) 
the variables from x and each variable in Aq must occur in A ... A An- 
Finally, we need the acyclicity property of IT. To define it, consider a directed 
graph whose vertices are predicates ^ of 11 and there is an edge from AtoB 
iff there is a formula in 11 which has B as the right-hand side and contains A 
in the left-hand side. Now 11 is called acyclic if the resulting graph is acyclic. 
Also an NDL-rewriting contains a goal predicate G and we say that a in Ad 
satisfies (11, G) over the data D iff 

Dun ^ G{a). 

Thus, a (n, G) is called an NDL-rewriting of the query q if 
D U r ^ q{a) ^ DUU\= G{a) 

for all D and all a. The size of an NDL-rewriting (H, G) is the number of 
symbols in it. 

Example 2. To illustrate the concept of NDL-rewriting we provide explicitly 
a rewriting for the query q from Example [D 

\/y, z {worksOn{z, y) —>■ Ni{y, z )), 

Vy, z {isManagedBy{y, z) Ni{y, z )), 

\/y, z {involves{y, z) — Ni{y, z)) , 

Vx, y, z {worksOn{x, y) A Ni{y, z) A Professor{z) —G'(x)) , 

Vx, y {worksOn{x, y) A Project{y) G{x )), 

Vx {Student{x) —> G{x )), 

where is a new binary predicate and G is the goal predicate of this NDL- 
rewriting. 

It is not hard to see that this rewriting is similar to the PE-rewriting q' 
from Example [TJ Indeed, Ni(y,z) is equivalent to the subformula 

(^worksOn{z, y) V isManagedBy{y, z) V involves{y, z)) 

of q' and G{x) is equivalent to q'{x). 
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It turns out that NDL-rewritings are more general than PE-rewritings. 
Indeed, a PE-rewriting q' has the form 3y(p{x,y), where is a monotone 
Boolean formula applied to atomic formulas (note that the existential quan¬ 
tifiers can be moved to the prefix due to the fact that there are no negations 
in the formula). The formulas in 11 can model V and A operations and thus 
can model the whole formula (p. For this, for each subformula of (p we in¬ 
troduce a new predicate symbol that depends on all variables on which this 
subformula depends. We model V and A operations on subformulas one by 
one. In the end we will have an atom F{x,y). Finally, we add to 11 the 
formula 

yx,y {F{x,y) ^ G{x)). (7) 

Then we have that, for any a, b, p{a, b) is true on D iff F{a, b) is true over 
L) U n. Finally, 3yp{a, y) is true over D iff there is h among constants such 
that p{a, b) is true. On the other hand, in 11 we can deduce G{a) iff there is 
b such that F{d,b) is true. Thus, given a PE-rewriting, we can construct an 
NDL-rewriting of approximately the same size. 

It is unknown whether NDL-rewritings and FO-rewritings are compara¬ 
ble. On the one hand, NDL-rewritings correspond to Boolean circuits and 
FO-rewritings — to Boolean formulas. On the other hand, FO-rewritings 
can use negations and NDL-rewritings are monotone. 

As we said above, we will consider only conjunctive queries q{x) to knowl¬ 
edge bases. However, in many cases queries have even simpler structure. To 
describe these restricted classes of queries, we have to consider a graph un¬ 
derlying the query. The vertices of the graph are variables appearing in q. 
Two vertices are connected iff their labels appear in the same atom of g. If 
this graph is a tree we call a query tree-like. If the graph is a path, then we 
call a query linear. 

4 Rewriting size lower bounds: general approach 

In this section we will describe the main idea behind the proofs of lower 
bounds on the size of query rewritings. 

Very informally, we encode Boolean functions inside of queries in such 
a way that the rewritings correspond to Boolean circuits computing these 
functions. If we manage to encode hard enough function, then there will be 
no small circuits for them and thus there will be no small rewritings. 

How exactly do we encode functions inside of queries? First of all we will 
restrict ourselves to the data D with only one constant element a. This is 
a substantial restriction on the data. But since our rewritings should work 
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for any data and we are proving lower bounds, we can make our task only 
harder. On the other hand, this restriction makes our lower bounds more 
general. 

Next, we introduce several unary predicates Ai,A 2 ,...,An and con¬ 
sider the formulas Ai(a). These predicates correspond to Boolean variables 
xi, ..., Xn of encoded function /: the variable Xi is true iff Ai(a) G D. There 
are other predicates in the signature and other formulas in D. Their role 
would be to make sure that 


DVJT \= q{x) 

iff the encoded function / is true on the corresponding input. 

This approach allows us to characterize the expressive power of various 
queries and theories. This characterization is summarized in the following 
table. 



depth 1 

depth d > 1 

arbitrary depth 

linear queries 

NC^ [IS] 

NL/poly 15 

NL/poly [1] 

tree-like queries 

^ NC^ ds] 

sAc^ m 

N P/poly [n] 

general queries 

NL/poly [16] 

N P/poly [III 

N P/poly [m 


The columns of the table correspond to the classes of the theories T. The 
rows of the table correspond to the classes of the queries q. An entry of the 
table represents the class of functions that can be encoded by queries and 
theories of these types. The results in the table give both upper and lower 
bounds. However, in what follows we will concentrate on lower bounds, that 
is we will be interested in how to encode hard functions and we will not 
discuss why harder functions cannot be encoded. 

Next, we need to consider a rewriting of one of the types described above 
and obtain from it the corresponding computational model computing /. 
This connection is rather intuitive: rewritings has a structure very similar 
to certain types of Boolean circuits. Namely, FO-rewritings are similar to 
Boolean formulas, PE-rewriting are similar to monotone Boolean formulas 
and NDL-rewritings are similar to monotone Boolean circuits. Thus, polyno¬ 
mial size FO-rewriting means that / is in NC^, polynomial size PE-rewriting 
means that / is in mNC^, and polynomial size NDL-rewriting means that / 
is in m P/poly. We omit the proofs of these reductions. 

Together with the table above this gives the whole spectrum of results 
on the size of rewritings. We just need to use the results on the relations 
between corresponding complexity classes. For example, in case of depth 1 
theories and path-like or tree-like queries there are polynomial rewritings of 
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all three types. In case of depth 2 theory and path-like or tree-like queries 
there are no polynomial PE-rewriting, there are no polynomial FO-rewritings 
under certain complexity-theoretic assumption, but there are polynomial 
NDL-rewritings. In case of depth 2 theories and arbitrary queries there are 
no polynomial PE- and NDL-rewritings and there are no polynomial FO- 
rewritings under certain complexity-theoretic assumption. 

Below we provide further details of the proofs of aforementioned results. 
The paper im used an add-hoc construction to deal with the case of un¬ 
bounded depth and non-linear queries. Subsequent papers (min provided 
a unified approach that uses the so-called hypergraph programs. 

In the next section we proceed to the discussion of these programs. 


5 Hypergraph programs: origination 

For the sake of simplicity we will restrict ourselves to Boolean queries only. 
Consider a query q = 3y(p{y) and consider its underlying graph G. Vertices 
of G correspond to the variables of q. Directed edges of G correspond to 
binary predicates in q. Each edge (tt, v) is labeled by all atomic formulas 
P{u, v) in q. Each vertex v is labeled by A if A{v) is in q. 

Let us consider data D. We can construct a universal model Mjj just by 
adding universal trees to each element of D. Let us see how the query can be 
satisfied by elements of the universal model. For this we need that for each 
variable t of the query we hnd a corresponding element in satisfying 
all the properties of t stated in the query. This element in M/j can be an 
element of the data and also can be an element of universal trees. 

Thus, for a query q to be satisfied we need an embedding of it into the 
universal model. That is we should map vertices of G into the vertices of 
the universal model M/j in such a way that for each label in G there is a 
corresponding label in Mq. We call this embedding a homomorphism. 

Now let us see how a vertex u of G can be mapped into an inner element 
tc of a universal tree R. This means that for all labels of v the vertex re in a 
universal tree R should have the same labels and for all adjacent edges of v 
there should be corresponding edges adjacent to u) in a universal tree. Thus 
all vertices adjacent to v should be also mapped in the universal tree R. We 
can repeat this argument for the neighbors of v and proceed until we reach 
vertices of G mapped into the root of R. So, if one of the vertices of G is 
embedded into a universal tree R, then so is a set of neighboring vertices. 
The boundary of this set of vertices should be mapped into the root of the 
universal tree. 
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Let us summarize what we have now. An answer to a query corresponds 
to an embedding of G into the universal model Md. There are connected 
induced subgraphs in G that are embedded into universal trees. The bound¬ 
aries of these subgraphs (the vertices connected to the outside vertices) are 
mapped into the root of the universal tree. Two subgraphs can intersect only 
by boundary vertices. These subgraphs are called tree witnesses. 

Given a query we can find all possible tree witnesses in it. Then, for any 
given data D there is an answer to the query if we can map the query into 
the universal model Mo- There is such a mapping if we can find a set of 
disjoint tree witnesses such that we can map all other vertices into D and 
the tree witnesses into the corresponding universal trees. 

Now assume for simplicity that there is only one element a in D. Thus 
D consists of formulas A{a) and P{a,a). To decide whether there is an 
answer to a query we need to check whether there is a set of tree witnesses 
which do not intersect (except by boundary vertices), such that all vertices 
except the inner vertices of tree witnesses can be mapped in a. Consider 
the following hypergraph H: it has a vertex for each vertex of G and for 
each edge of G; for each tree witness there is a hyperedge in H consisting 
of vertices corresponding to the inner vertices of the tree witness and of 
vertices corresponding to the edges of the tree witness. For each vertex v 
of the hypergraph H let us introduce a Boolean variable and for each 
hyperedge e of the hypergraph H — a, Boolean variable Xe- For a given D 
(with one element a) let be equal to 1 iff u can be mapped in a and let Xe be 
equal to 1 iff the unary predicate generating the tree witness corresponding 
to the hyperedge e is true on a. From the discussion above it follows that 
there is an answer to a rewriting for a given D iff there is a subset of disjoint 
hyperedges such that Xe = 1 for them and they contain all vertices with 

Xy = 0 . 

This leads us to the following definition. 

Definition 2 (Hypergraph program). A hypergraph program H \s a hy¬ 
pergraph whose vertices are labeled by Boolean variables xi,...,Xn, their 
negations or Boolean constants 0 and 1. A hypergraph program H outputs 
1 on input x € {0,1}"" iff there is a set of disjoint hyperedges covering all 
vertices whose labels evaluates to 0. We denote this by H{x) = 1. A hy¬ 
pergraph program computes a Boolean function /: {0,1}” —)■ {0,1} iff for 
all X € {0,1}” we have H{x) = /(x). The size of a hypergraph program is 
the number of vertices plus the number of hyperedges in it. A hypergraph 
program is monotone iff there are no negated variables among its labels. 

Remark. Note that in the discussion above we obtained somewhat different 
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model. Namely, there were also variables associated to hyperedges of the 
hypergraph. Note, however, that our definition captures also this extended 
model. Indeed, we can introduce for each hyperedge e a couple of new fresh 
vertices Ve and Ue and a new hyperedge eh We add Vg to the hyperedge e and 
we let e' = {ve,Ue}. The label of is 1 and the label of u^, is the variable 
Xe- It is easy to see that Xg = 0 iff we cannot use the hyperedge e in our 
cover. 

So far we have discussed how to encode a Boolean function by a query 
and a theory. We have noted that the resulting function is computable by a 
hypergraph program. We denote by HGP the class of functions computable 
by hypergraph programs of polynomial size (recall, that we actually consider 
sequences of functions and sequences of programs). 

Various restrictions on queries and theories result in restricted versions 
of hypergraph programs. If a theory is of depth 1, then each tree witness 
has one inner vertex and thus two different hyperedges can intersect only by 
one vertex corresponding to the edge of G. Thus each vertex corresponding 
to the edge of G can occur in at most two hyperedges and the resulting 
hypergraph program is of degree at most 2. We denote by HGPfc the set of 
functions computable by polynomial size hypergraph programs of degree at 
most k. 

If a query is tree-like (or linear), then the hypergraph program will have 
an underlying tree (or path) structure and all hyperedges will be its subtrees 
(subpaths). We denote by HGPtree (HGPpa^/i) the set of functions computable 
by hypergraph programs of polynomial size and with underlying tree (path) 
structure. 

However, to prove lower bounds we need to show that any hypergraph 
program in certain class can be encoded by a query and a theory of the 
corresponding type. These statements are proved separately by various con¬ 
structions of queries and theories. We will describe a construction for general 
hypergraph programs as an example. 

Consider a hypergraph program P and consider its underlying hyper¬ 
graph H = {V,E). It would be more convenient to consider a more general 
hypergraph program P' which has the same underlying hypergraph H and 
each vertex vinV is labeled by a variable Xy. Clearly, the function computed 
by P can be obtained from the function computed by P' by fixing some vari¬ 
ables to constant and identifying some variables (possibly with negations). 
Thus it is enough to encode in a query and a theory the function computed 
by P'. We denote this function by /. 

To construct a theory and a query encoding / consider the following 
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directed graph G. It has a vertex for each vertex v of the hypergraph 
H and a vertex Ze for each hyperedge e of the hypergraph H. The set of 
edges of G consists of edges (zy,Ze) for all pairs {v,e) such that v £ e. This 
graph will be the underlying graph of the query. For each vertex Zg the 
subgraph induced by all vertices on the distance at most 2 from Zg will be 
a tree witness. In other words, this tree witness contains vertices Zy for all 
V £ e and Zg' for all e' such that e' fl e 7 ^ 0. The latter vertices are boundary 
vertices of the tree witness. 

The signature contains unary predicates Ay for all v £ V, unary predi¬ 
cates Ay, Be and binary predicates for aWe £ E. Intuitively, the predicate 
Ae generates tree-witness corresponding to Zg, the predicate encodes that 
its input correspond to Zy with v £ e, the predicate encodes that its in¬ 
puts correspond to (zg, Zy) and v £ e, the predicate Ay encodes the variable 

Xy of /. 

Our Boolean query q consists of atomic formulas 

{Ay(zy) I u G F} U {Re{ze, Zy) \ v £ e, ioi v £V and e G E}. 

Here Zy and Zg for all u G H and e £ E are existentially quantified variables 
of the query. 

Theory T consists of the following formulas (the variable x is universally 
quantified): 

Ae{x) -^3y f\ {Re'{x,y)ABe{y)), Be{x) -£ f\ Ay{x), Be{x) -A 3yRe{y,x). 
eneV 0 

e^e' 

In particular, each predicate A^ generates a universal tree of depth 2 con¬ 
sisting of 3 vertices a,Wyertex^'’^tdge following predicates (a is a 

root of the universal tree): 


Ae{a), 

Re' {a, wlertex) c' 7 ^ 6, c' O 6 7 ^ 0, 

^ e(w vertex) 1 

^v{wlertex) all V £ 6, 

Re {^edge > ^vertex )' 


a o Ae 

Re' 

^vertex Q Be^ Ay 
Re 


W, 


edge ^ 


There are other universal trees generated by predicates Be, but we will 
consider only data in which Be are not presented, so the corresponding uni¬ 
versal trees also will not be presented in the universal model. 
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There is one constant a in our data and we will restrict ourselves only to 
the data containing Ae{a) for all e and Re{a, a) for all e and not containing 
Be for all e. For convenience denote Dq = {Ae{a),Re{a,a) for all e € E}. 
The predicates Ay will correspond to the variables Xy of the function /. That 
is the following claim holds. 

Claim 1. For all x G {0,1}"' f{x) = 1 iff DUT \= q for D = DqU {yli,(a) | 
Xy = 1}. 

Proof. Note hrst that if Ay (a) is true for all v then the query is satishable. 
We can just map all vertices Ze a-^d Zy to a. However, if some predicate Ay (a) 
is not presented, then we cannot map Zy to a and have to use universal trees. 

Suppose f(x) = 1 for some x G {0,1}”" and consider the corresponding 
data D. There is a subset of hyperedges E' C E of H such that hyperedges 
in E' do not intersect and all u G H such that Xy = 0 lie in hyperedges of 
E'. Then we can satisfy the query in the following way. We map the vertices 
Ze with e ^ E' to a. We map all vertices Zy such that v is not contained in 
hyperedges of E' also into a. If for Zy we have u G e for e £ E', then we send 
Zy to the Wyertex Vertex in the universal tree Ma^- Finally, we send vertices 
Ze with e £ E' to vertex of the universal tree Ma^- It is easy to see 

that all predicates in the query are satished. 

In the other direction, suppose for data D the query q is true. It means 
that there is a mapping of variables Zy and Zg for all v and e into universal 
model Mjo- Note that the vertex Zg can be sent either to a, or to the vertex 
w^dge universal tree Ma^- Indeed, only these vertices of Mo has outgo¬ 

ing edge labeled by Rg. Consider the set E' = {e £ E \ Zg is sent to w^dgg}- 
Consider some e £ E' and note that for any e\ such that e! e and e^fle 0, 
Zgt is on the distance 2 from Zg in G and Zgt should be mapped in a. Thus 
hyperedges in E' are non-intersecting. If for some Zy the atom Ay (a) is not 
in D, then Zy cannot be mapped into a. Thus it is mapped in the vertex 
^vertex some for some e containing v. But then Zg should be mapped 
into w^dge same universal tree (there is only one edge leaving Wygy^g^ 

labeled by Rg). Thus e £ E' and thus v is covered by hyperedges of E'. 
Overall, we have that hyperedges in E' give a disjoint cover of all zeros in 
P' and thus f{x) = 1. 

□ 


6 Hypergraph programs: complexity 

We have discussed that hypergraph programs can be encoded by queries and 
theories. Now we need to show that there are hard functions computable by 
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hypergraph programs. For this we will determine the power of various types 
of hypergraph programs. Then the existence of hard functions will follow 
from known results in complexity theory. 

We formulate the results on the complexity of hypergraph programs in 
the following theorem. 

Theorem 3 f[TBl H]l. The following equations hold both in monotone and 
non-monotone cases: 

1. HGP = HGPs = NP/poly; 

2. HGPs = NBP; 

3. P\GPpath = NBP; 

4 . HGPtree = SAG^ 

Together with the discussion of two previous sections this theorem gives 
the whole picture of proofs of lower bounds on the rewriting size for consid¬ 
ered types of queries and theories. 

We do not give a complete proof of Theorem [3] here, but in order to 
present ideas behind it, we give a proof of the hrst part of the theorem. 

Proof. Clearly, HGPg C HGP. 

Next, we show that HGP C NP/poly. Suppose we have a hypergraph 
program of size m with variables x. We construct a circuit C{x,y) of size 
poly(m) satisfying ([TJ. Its T-variables are precisely the variables of the pro¬ 
gram, and certihcate variables y correspond to the hyperedges of the pro¬ 
gram. The circuit C will output 1 on (x, y) iff the family {e \ Pe = 1} of 
hyperedges of the hypergraph forms a disjoint set of hyperedges covering all 
vertices labeled by 0 under x. It is easy to construct a polynomial size cir¬ 
cuit checking this property. Indeed, for each pair of intersecting hyperedges 
(e, e') it is enough to compute disjunction -'T/e V “'T/e', and for each vertex v 
of the hypergraph with label t and contained in hyperedges ei,... ,6^ it is 
enough to compute disjunction t V T/ej V • • • V T/ej, ■ It then remains to compute 
a conjunction of these disjunctions. It is easy to see that this construction 
works also in monotone case (note that applications of -' to y-variables in 
the monotone counterpart of NP/poly are allowed). 

Now we show that NP/poly C HGPg. Consider a function / G NP/poly 
and consider a circuit C{x, y) satisfying ([1]) . Let be the gates of 

C (including the inputs x and y). We construct a hypergraph program of 
degree < 3 computing / of size polynomial in the size of C. For each i 
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we introduce a vertex gi labelled with 0 and a pair of hyperedges e^. and 
Cg., both containing gi. No other hyperedge contains gi, and so either eg. 
or eg^ should be present in any cover of zeros in the hypergraph program. 
Intuitively, if the gate gi evaluates to 1 then e^. is in the cover, otherwise e^. 
is there. To ensure this property for each input variable Xj, we add a new 
vertex Vi labelled with -iXj to Cxi and a new vertex Ui labelled with Xi to Cxi ■ 
For a non-variable gate gi, we consider three cases. 

- gi = -^gj then we add a vertex labelled with 1 to eg^ and Eg., and a 
vertex labelled with 1 to eg^ and eg.. 

- li gi = gj V gjf then we add a vertex labelled with 1 to eg. and e^., add 
a vertex labelled with 1 to Cg^, and e^.; then, we add vertices hj and 
hj/ labelled with 1 to Egj and Eg.,, respectively, and a vertex Wi labeled 
with 0 to Cg.; finally, we add hyperedges {hj,Wi} and {hji,Wi}. 

- li gi = gj A gj' then we use the dual construction. 

In the first case it is not hard to see that eg. is in the cover iff Egj is in the 
cover. In the second case is in the cover iff at least one of Cgj and eg., is 
in the cover. Indeed, in the second case if, say, the cover contains Cgj then it 
cannot contain Eg., and so it contains e^.. The vertex Wi in this case can be 
covered by the hyperedge {hj,Wi} since Eg^ is not in the cover. Conversely, 
if neither eg^ nor Cg., is in the cover, then it must contain both Egj and Eg^, 
and so, neither {hj,Wi} nor {hj/,Wi} can belong to the cover and we will 
have to include e^. to the cover. Finally, we add one more vertex labelled 
with 0 to eg for the output gate g of C. It is not hard to show that, for each 
X, there is y such that C {x, y) = 1 iff the constructed hypergraph program 
returns 1 on x. 

For the monotone case, we remove all vertices labelled with -iXj. Then, 
for an input x, there is a cover of zeros in the resulting hypergraph program 
iff there are y and x' ^ x with C{x',y) = 1. □ 
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